现有的抽象摘要模型缺乏明确的控制机制,允许用户影响模型输出的风格特征。这导致生成不迎合用户需求或偏好的通用摘要。为了解决这个问题,我们介绍了Hydrasum,这是一种新的摘要架构,其扩展了当前模型的单个解码器框架,例如, BART,到专家的混合版本,包括多个解码器。我们拟议的模型鼓励每个专家,即解码器,沿着尺寸学习和生成风格不同的摘要,例如抽象,长度,特异性等。在每个时间步骤中,Hydrasum采用一个门控机制,该机构决定每个单独解码器对下一个令牌的输出概率分布的贡献。通过对三个摘要数据集的实验(CNN,新闻编辑室,XSUM),我们证明了这种门控机制自动学习在标准培训目标下将对比摘要样式分配给不同的水路解码器,而无需额外监督。我们进一步表明,培训过程的指导版本可以明确地管理哪些摘要样式在解码器之间分区,例如,高抽象力与低吸引力或高特异性与低特异性,并且还增加各个解码器之间的致命差异。最后,我们的实验表明,我们的解码器框架非常灵活:在推理期间,我们可以从单独的解码器或解码器的不同子集的混合物中进行采样,以产生多种摘要,并强制对摘要生成的单一和多样式控制。
translated by 谷歌翻译
We identify the task of measuring data to quantitatively characterize the composition of machine learning data and datasets. Similar to an object's height, width, and volume, data measurements quantify different attributes of data along common dimensions that support comparison. Several lines of research have proposed what we refer to as measurements, with differing terminology; we bring some of this work together, particularly in fields of computer vision and language, and build from it to motivate measuring data as a critical component of responsible AI development. Measuring data aids in systematically building and analyzing machine learning (ML) data towards specific goals and gaining better control of what modern ML systems will learn. We conclude with a discussion of the many avenues of future work, the limitations of data measurements, and how to leverage these measurement approaches in research and practice.
translated by 谷歌翻译
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
translated by 谷歌翻译
The majority of available text summarization datasets include short-form source documents that lack long-range causal and temporal dependencies, and often contain strong layout and stylistic biases. While relevant, such datasets will offer limited challenges for future generations of text summarization systems. We address these issues by introducing BookSum, a collection of datasets for long-form narrative summarization. Our dataset covers source documents from the literature domain, such as novels, plays and stories, and includes highly abstractive, human written summaries on three levels of granularity of increasing difficulty: paragraph-, chapter-, and book-level. The domain and structure of our dataset poses a unique set of challenges for summarization systems, which include: processing very long documents, non-trivial causal and temporal dependencies, and rich discourse structures. To facilitate future work, we trained and evaluated multiple extractive and abstractive summarization models as baselines for our dataset.
translated by 谷歌翻译
Sentiment analysis or opinion mining help to illustrate the phrase NLP (Natural Language Processing). Sentiment analysis has been the most significant topic in recent years. The goal of this study is to solve the sentiment polarity classification challenges in sentiment analysis. A broad technique for categorizing sentiment opposition is presented, along with comprehensive process explanations. With the results of the analysis, both sentence-level classification and review-level categorization are conducted. Finally, we discuss our plans for future sentiment analysis research.
translated by 谷歌翻译
机器人的共同适应一直是一项长期的研究努力,其目的是将系统的身体和行为适应给定的任务,灵感来自动物的自然演变。共同适应有可能消除昂贵的手动硬件工程,并提高系统性能。共同适应的标准方法是使用奖励功能来优化行为和形态。但是,众所周知,定义和构建这种奖励功能是困难的,并且通常是一项重大的工程工作。本文介绍了关于共同适应问题的新观点,我们称之为共同构图:寻找形态和政策,使模仿者可以紧密匹配演示者的行为。为此,我们提出了一种通过匹配示威者的状态分布来适应行为和形态的共同模拟方法。具体而言,我们专注于两种代理之间的状态和动作空间不匹配的挑战性情况。我们发现,共同映射会增加各种任务和设置的行为相似性,并通过将人的步行,慢跑和踢到模拟的人形生物转移来证明共同映射。
translated by 谷歌翻译
自从有新闻以来,假新闻一直存在,从谣言到印刷媒体再到广播电视。最近,信息时代及其沟通和互联网突破加剧了假新闻的传播。此外,除了电子商务外,当前的互联网经济取决于广告,视图和点击,这促使许多开发人员诱饵最终用户点击链接或广告。因此,假新闻通过社交媒体网络的狂野传播影响了现实世界中的问题,从选举到5G的采用以及Covid-19大流行的处理。自虚假新闻出现以来,从事实检查员到基于人工智能的探测器,探测和阻止假新闻的努力就一直存在。由于假新闻传播器采用了更复杂的技术,因此解决方案仍在不断发展。在本文中,R代码已用于研究和可视化现代假新闻数据集。我们使用聚类,分类,相关性和各种图来分析和呈现数据。该实验显示了分类器在与虚假新闻中分开的效率高效率。
translated by 谷歌翻译
双支持向量机(TWSVM)和双支持向量回归(TSVR)是新兴有效的机器学习技术,可分别为分类和回归挑战提供了有希望的解决方案。 TWSVM基于该想法来识别两个非平行超平面,将数据指向其各自的类分类。它需要解决两个小型大小的二次编程问题(QPPS)代替求解单个大尺寸QPP在支持向量机(SVM),而TSVR配制在TWSVM的线上,并要求解决两个SVM类问题。虽然这些技术已经有很好的研究进展;关于TSVR的不同变体的比较有限的文献。因此,本综述对TWSVM和TSVR的最近研究同时提到了它们的局限性和优势,对最近的研究提供了严格的分析。首先,首先介绍支持向量机,TWSVM的基本理论,然后专注于TWSVM的各种改进和应用,然后介绍TSVR及其各种增强功能。最后,我们建议未来的研发前景。
translated by 谷歌翻译